I. Introduction

As a new popular and eco-friendly public transportation in the 21th Century, bike-share programs have been supported by local governments in American major cities. It is convenient for riders to pick up a bike in one station and return it nearby in the destination with low cost. Bike share programs can efficiently reduce bikes’ parking overflow conditions and vehicle congestion.However, promoting bike-share programs generate the re-balancing program. Re-balancing will occur if the demand of bikes is over the supply or the supply is over the demand. It is both hard for riders for pick up or deposit a bike if the dock is empty or overflow. Therefore, the program coordinators should make a plan to solve the re-balancing issue, such as transporting more bikes to underutilized stations and encourage riders to park the bikes in those stations by reducing the ride fee.

Our project uses City of Boston as an example, making a prediction of the bike pickup availability over time and bike-share stations. We use the open bike share data from Janurary to Feburary in 2022, and we also retrieved the weather data, including temperature and precipitation that might affect the pickup demand. We choose the time lag features of 1, 2, 3, 4, 12 hours and 1 day to capture the temporal dependencies in the time series. The time lag features calculate the amount of bike share trips based on historical data in the defined time range, which make the prediction more accurate and have longer time horizons.

II. Code Setup

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(gganimate)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(viridis)
library(caret)

options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

III. Data Exploratory

The ride-share data in Boston is imported from the official website of the program – Blue Bikes. Since the program is also covered in the surrounding cities around the metro Boston, geographic data of these cities is also retrieved from the Massachusetts government. The pickup time of the ride share data is recalculated based on a 15-minute and an 1-hour schedule, and we added the week and weekday for the pickup time. The finalized ride data is narrowed down to a 5-week range.

root.dir = "/Users/rachelren/Desktop/Upenn/MUSA_5080/HW6_Ren"
bike01 <- read.csv(file.path(root.dir,"202201-bluebikes-tripdata.csv"))
bike02 <- read.csv(file.path(root.dir,"202202-bluebikes-tripdata.csv"))
bike_rides <- rbind(bike01,bike02)
MA_cities <- st_read("https://arcgisserver.digital.mass.gov/arcgisserver/rest/services/AGOL/SurveyTowns_WebMerc_doc/MapServer/0/query?outFields=*&where=1%3D1&f=geojson")
cityList <- 
  c("ARLINGTON", "BOSTON", "BROOKLINE", "CAMBRIDGE", "CHELSEA", "EVERETT", "NEWTON", "REVERE", "SALEM", "SOMERVILLE", "WATERTOWN")
GreaterBoston <- MA_cities %>%
  filter(TOWN %in% cityList)%>%
  st_transform(crs=26986)
MBTA <- st_read("https://arcgisserver.digital.mass.gov/arcgisserver/rest/services/AGOL/MBTA_Commuter_Rail/MapServer/0/query?outFields=*&where=1%3D1&f=geojson") %>%
  st_transform(crs=26986)
rides <-
  bike_rides %>%
  mutate(
    interval60 = floor_date(ymd_hms(starttime), unit = "hour"),
    interval15 = floor_date(ymd_hms(starttime), unit = "15 mins"),
    week = week(interval60),
    dotw = wday(interval60, label = TRUE),
    X = as.numeric(start.station.longitude),
    Y = as.numeric(start.station.latitude)
  ) %>%
  filter(week %in% c(1:5))

rides_sf <- st_as_sf(rides, coords = c("X", "Y"), crs = 4326, agr = "constant")

Figure 3.1 shows the line plot of number of bike-share trips by hour in Boston. The amount experienced a high amount at around Jan 25, but it experienced a constant low value in the end of Janurary.

ggplot(rides_sf %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. Boston, Janurary and Feburary, 2022",
       x="Date", 
       y="Number of trips",
       caption = "Figure 3.1")+
  plotTheme()

Figure 3.2 shows the trip volume by station for different times of the day. There is not a high volume of demand in the morning, but it occurs in the mid-day, afternoon and in the midnight. However, most stations consists a low demand.

st_drop_geometry(rides_sf) %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start.station.id, time_of_day) %>%
         tally()%>%
  group_by(start.station.id, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. Boston, Janurary and Feburary, 2022",
       x="Number of trips", 
       y="Frequency",
       caption = "Figure 3.2")+
  facet_wrap(~time_of_day)+
  plotTheme()

Figure 3.3 shows the trends of bike-share amount by hour on each week day as well as the comparison between the weekday and weekend. Wednesday and Thursday have a higher amount than Saturday. People use the shared bikes more often in the afternoon and midnight, and there are still a large amount of people riding in the late night.

ggplot(rides_sf %>% mutate(hour = hour(starttime)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in Boston, by day of the week, Janurary and Feburary, 2022",
       x="Hour", 
       y="Trip Counts",
       caption = "Figure 3.3")+
     plotTheme()

ggplot(rides_sf %>% 
         mutate(hour = hour(starttime),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in Boston - weekend vs weekday, Janurary and Feburary, 2022",
       x="Hour", 
       y="Trip Counts",
       caption = "Figure 3.4")+
     plotTheme()

Figure 3.5 shows clearly about the amount of demand in each station geographically. In the weekday, there is a large amount of bike demand in Cambridge, which is in the central north of the map. We guess that many students from MIT and Harvard University used the bike-share rides to go home after studying. For other stations, there is no significant difference between the weekday and weekend.

rides_sf <- rides_sf %>%
  st_transform(crs = 26986)

ggplot() +
  geom_sf(data = GreaterBoston)+
  geom_point(
    data = rides_sf %>%
      mutate(
        hour = hour(starttime),
        weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
        time_of_day = case_when(
          hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
          hour(interval60) >= 7 &
            hour(interval60) < 10 ~ "AM Rush",
          hour(interval60) >= 10 &
            hour(interval60) < 15 ~ "Mid-Day",
          hour(interval60) >= 15 &
            hour(interval60) <= 18 ~ "PM Rush"
        )
      ) %>%
      group_by(
        start.station.id,
        start.station.latitude,
        start.station.longitude,
        weekend,
        time_of_day
      ) %>%
      tally(),
    aes(x = st_coordinates(geometry)[, 1], 
                 y = st_coordinates(geometry)[, 2], color = n),
    fill = "transparent",
    alpha = 0.4,
    size = 0.3
  ) +
  scale_colour_viridis(direction = -1,
                       discrete = FALSE,
                       option = "D") +
  ylim(min(st_coordinates(rides_sf$geometry)[, 2]),
       max(st_coordinates(rides_sf$geometry)[, 2])) +
  xlim(min(st_coordinates(rides_sf$geometry)[, 1]),
       max(st_coordinates(rides_sf$geometry)[, 1])) +
  facet_grid(weekend ~ time_of_day) +
  labs(title = "Bike share trips per hr by station. Boston, Janurary, 2022",
       caption = "Figure 3.5") +
  mapTheme()

The weather data is retrieved from Iowa Environment Mesonet based on Boston Logan International Airport weather station across 5 weeks. Because in winter there might be snow and strong wind in Boston, the weather data might affect the amount of bike-share rides. We created a weather panel that calculates the precipation, wind speed and temperature hourly. Figure 3.1, 3.2 and 3.3 shows the change of weather by hours in five weeks. There might be extreme weather around Jan 17 as the temperature is as low as 0, and the wind speed is relatively high.

weather.Data <- 
  riem_measures(station = "BOS", date_start = "2022-01-01", date_end = "2022-02-05")
weather.Panel <-  
  weather.Data %>%
    mutate_if(is.character, list(~replace(as.character(.), is.na(.), "0"))) %>% 
    replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
grid.arrange(top = "Weather Data - Boston - Janurary & Feburary, 2022",
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
    labs(title="Percipitation", x="Hour", y="Percipitation",caption = "Figure 3.6") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed",caption = "Figure 3.7") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature",caption = "Figure 3.8") + plotTheme())

To create a five-week space-time ride panel where each time period of each station in the study is represented by a row, we created a study panel first to calculate the unique combinations of space and time observations. The ride panel is then created with the number of trip counts for each time intervals and necessary weather variables. The number of rows in the ride panel is equal to the study panel.

bos_study.panel <-
  expand.grid(
    interval60 = unique(rides_sf$interval60),
    start.station.id = unique(rides_sf$start.station.id)
  )

grouped_rides_sf <- st_drop_geometry(rides_sf) %>%
      select(
        start.station.id,
        start.station.name,
        start.station.latitude,
        start.station.longitude
      ) %>%
      distinct() %>%
      group_by(start.station.id) %>%
      slice(1)

bos_study_joined.panel <- left_join(bos_study.panel, grouped_rides_sf, by="start.station.id")
bos_ride.panel <- 
  rides_sf %>%
    mutate(Trip_Counter = 1) %>%
    right_join(bos_study_joined.panel) 

bos_grouped_ride.panel <- st_drop_geometry(bos_ride.panel) %>%
  group_by(
    interval60,
    start.station.id,
    start.station.name,
    start.station.latitude,
    start.station.longitude
  ) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T))%>%
      left_join(weather.Panel, by = "interval60")%>%
          mutate(week = week(interval60),
                 dotw = wday(interval60, label = TRUE))

We created time lags of 1, 2, 3, 4, 5, 12 and 24 hours separately to calculate the historical trip count prior to each hourly range.

bos_grouped_ride.panel <- 
  bos_grouped_ride.panel %>% 
  arrange(start.station.id, interval60) %>% 
  group_by(start.station.id) %>%
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
  ungroup

To evaluate the correlations in these lags, we summarized them into a table. 1-hour lag has a strong positive correlation of 0.89 because the demand is similar in adjacent hours, and the correlation decreases as the hour increases. The 12-hour lag has a negative correlation because it is likely to have an opposite demand. For example, 6 am in the morning is hard to have a similar demand with 6 pm in the afternoon. The 1-day lag has a high positive correlation because the demand at this time is highly similar to the demand at the same time tomorrow.

as.data.frame(bos_grouped_ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
  filter(!is.na(Value))%>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))
## # A tibble: 6 × 2
##   Variable   correlation
##   <fct>            <dbl>
## 1 lagHour           0.89
## 2 lag2Hours         0.71
## 3 lag3Hours         0.53
## 4 lag4Hours         0.37
## 5 lag12Hours       -0.48
## 6 lag1day           0.76

We performed correlation between the time lag features and trip counts for week 1. All of the correlations are not very strong, especially for 12-hour lag. The correlation decreases with each additional lag hour, but it returns back similar with the 1-hour lag after 1 day.

plotData.lag <-
  filter(as.data.frame(bos_grouped_ride.panel), week == 1) %>%
  dplyr::select(starts_with("lag"), Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  mutate(Variable = fct_relevel(Variable, "lagHour","lag2Hours","lag3Hours",
                                          "lag4Hours","lag12Hours","lag1day"))
correlation.lag <-
  group_by(plotData.lag, Variable) %>%
    summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"), 2))

kable(correlation.lag, caption = "Table 3.9 Correlation Table of Time Lags and Trip Counts") %>%
   kable_styling()
Table 3.9 Correlation Table of Time Lags and Trip Counts
Variable correlation
lagHour 0.46
lag2Hours 0.42
lag3Hours 0.36
lag4Hours 0.30
lag12Hours 0.04
lag1day 0.44
ggplot(plotData.lag, aes(x = Value, y = Trip_Count)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  facet_wrap(~ Variable, scales = "free") +
  theme_minimal() +
  labs(title = "Rideshare trip count as a function of time lags",
       subtitle = "One week in Janurary, 2022",
       x = "Lag Trip Count",
       y = "Trip Count",
       caption = "Figure 3.10")

Figures 3.11 maps sum of trips for each station by week. The spatial pattern is relatively consistent as central boston, including Cambridge and Charlestown as well as the downtown Boston have a high trip amounts, while the surrounding areas have less amounts. Therefore, the program should consider the re-balancing bikes from suburban areas to the downtown and Cambridge area, especially for those stations near universities.

bos_grouped_ride_panel_sf <-
  bos_grouped_ride.panel %>%
  mutate(
    X = as.numeric(start.station.longitude),
    Y = as.numeric(start.station.latitude)
  ) %>% st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")

spatial_auto <- bos_grouped_ride_panel_sf %>% 
  st_transform(crs=26986)%>%
  group_by(week, start.station.id) %>%
  summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
  ungroup()
ggplot() + 
  geom_sf(data = GreaterBoston) +
  geom_sf(data = spatial_auto, aes(color = q5(Sum_Trip_Count)), size = 1) +
  facet_wrap(~week, ncol = 3) +
  scale_color_manual(values = palette5, name = "Trip_Count") +
  labs(title = "Sum of rideshare trips by tract and week", caption = 3.11) +
  mapTheme() +
  theme(legend.position = "bottom")

We also experimented with the amenity feature of the subway stations and calculates the distance between the bike station and the nearest subway station. Compared with the amount of ride maps above, we can find that the Cambridge area has a longer distance to the subway station and they have relative high demand, but the surrounding areas who have a shorter distance to subway stations have a relative lower demand.

bos_grouped_ride.panel$X <- bos_grouped_ride.panel$start.station.longitude
bos_grouped_ride.panel$Y <- bos_grouped_ride.panel$start.station.latitude
bos_grouped_ride.panel <- st_as_sf(bos_grouped_ride.panel, coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(crs=26986)

nearest_subway_distances <- st_distance(bos_grouped_ride.panel, MBTA) %>% 
  apply(1, FUN = min)

bos_grouped_ride.panel$nearest_subway_distance <- nearest_subway_distances

ggplot() +
  geom_sf(data = GreaterBoston)+
  geom_sf(data = bos_grouped_ride.panel,aes(color = nearest_subway_distance, geometry = geometry)) +
  labs(title = "measure nearest distance to subway stations",
       color = "nearest_subway_distance",
       caption = "Figure 3.12") +
  theme()

For weather dependencies, the precipitation might cause the worse road condition, so there might be fewer trips in a rainy or snowy day than a sunny day. Figure 3.13 shows the mean trip count varied by precipitation. Although the mean count is lower in rainy days, the difference is not that significant.

bos_grouped_ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Precipitation = first(Precipitation)) %>%
  mutate(isPrecip = ifelse(Precipitation > 0,"Rain/Snow", "None")) %>%
  group_by(isPrecip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
    ggplot(aes(isPrecip, Mean_Trip_Count)) + geom_bar(stat = "identity") +
      labs(title="Does ridership vary with precipitation?",
           x="Precipitation", y="Mean Trip Count",
           caption = "Figure 3.13") +
      plotTheme()

A relationship between temperature and trip count is shown in Figure 3.14. Although there is a positive relation as the week increases, the correlation is not very strong, and all the points are scattered in the plot. Therefore, it is hard to say that there is a strong correlation between the temperature and the trip count.

bos_grouped_ride.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE) +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count as a fuction of Temperature by week",
         x="Temperature", y="Mean Trip Count",
         caption = "Figure 3.14") +
    plotTheme() 

To run the prediction model for the next step, we first splitted our five-week panel with a 3-week panel for training data and a 2-week panel for testing data.

ride.Train <- filter(bos_grouped_ride.panel, week <= 3)
ride.Test <- filter(bos_grouped_ride.panel, week > 3)

IV. Modeling and Validation

We run four OLS regressions with different time and space variables to make a comparison of prediction accuracy. The first two regressions are based on only time and space features with temperature, while the third one combines both and the last one adds the time lag features.

reg1 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature, data=ride.Train)

reg2 <- lm(Trip_Count ~  start.station.id + dotw + Temperature, data=ride.Train)

reg3 <- lm(Trip_Count ~  start.station.id + hour(interval60) + dotw + Temperature, 
                         data=ride.Train)

reg4 <- lm(Trip_Count ~  start.station.id +  hour(interval60) + dotw + Temperature +
                         lagHour + lag2Hours + lag3Hours + lag12Hours + lag1day, 
                         data=ride.Train)

The nested format of each regression results is created across two weeks with mean absolute error. Figure 4.1 shows the mean absolute error for 4 regression models. It is clear to see that the space-time-lags regression has a significant smaller mean absolute error than the previous three. Therefore, the model is the most accurate and generalizable.

ride.Test.weekNest <- 
  as.data.frame(ride.Test) %>%
  nest(-week) 
model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(A_Time_FE = map(.x = data, fit = reg1, .f = model_pred),
           B_Space_FE = map(.x = data, fit = reg2, .f = model_pred),
           C_Space_Time_FE = map(.x = data, fit = reg3, .f = model_pred),
           D_Space_Time_Lags = map(.x = data, fit = reg4, .f = model_pred))
week_predictions <- week_predictions %>%  
    gather(Regression, Prediction, -data, -week) %>% 
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean),
           sd_AE = map_dbl(Absolute_Error, sd))

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week",
         caption = "Figure 4.1") +
  plotTheme()

Figure 4.2 provides a more clear comparison of the model accuracy. The line pattern showing the trends of the trip count change between the observed values and predicted value is more overlaid in the space-time-lag regression. However, predicted trip counts are slightly smaller than the observed counts.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start.station.id = map(data, pull, start.station.id))%>%
  dplyr::select(interval60, start.station.id, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval60, -start.station.id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = mean(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      scale_colour_manual(values = palette2) +
      labs(title = "Mean Predicted/Observed bike share by hourly interval", 
           x = "Hour", y= "Bikeshare Trips",
           caption = "Figure 4.2") +
      plotTheme()

To validate the test set by space, the mean absolute error for each station is mapped in Figure 4.3. Higher errors occurred in Cambridge and Charlestown area, where also have a high amount of bike demand, and lower errors are in the nearby towns and suburban areas.

error.byWeek <-
  filter(week_predictions, Regression == "D_Space_Time_Lags") %>% 
  unnest() %>% 
  st_sf() %>%
  st_transform(crs = 26986)%>%
  dplyr::select(start.station.id, Absolute_Error, week, geometry) %>%
  gather(Variable, Value, -start.station.id, -week, -geometry) %>%
    group_by(Variable, start.station.id, week) %>%
    summarize(MAE = mean(Value))
ggplot() +
  geom_sf(data = GreaterBoston)+
  geom_sf(data = error.byWeek,aes(color = MAE, geometry = geometry)) +  
  facet_wrap(~ week) + 
  labs(title = "Mean Absolute Error by station by week",
       color = "MAE",
       caption = "Figure 4.3") +
  theme()

Figure 4.4 mapped the observed and predicted amount of trips during week days and weekends by hours, and results show an underpredicted pattern. It has the lowest accuracy in the morning.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start.station.id = map(data, pull, start.station.id), 
           start.station.latitude = map(data, pull, start.station.latitude), 
           start.station.longitude = map(data, pull, start.station.longitude),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start.station.id, start.station.longitude, start.station.latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "D_Space_Time_Lags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips",
       caption = "Figure 4.4")+
  plotTheme()

Figure 4.5 maps the mean absolute error for amount predictions across different times and weekdays. In each mao, the areas, such as Cambridge, have a higher MAE during the afternoon and midnight as the demand increases. The high increase of bike demands make the prediction more vulnerable, causing a higher fluctuation of prediction.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start.station.id = map(data, pull, start.station.id), 
           start.station.latitude = map(data, pull, start.station.latitude), 
           start.station.longitude = map(data, pull, start.station.longitude),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start.station.id, start.station.longitude, start.station.latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "D_Space_Time_Lags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start.station.id, start.station.longitude, start.station.latitude, weekend, time_of_day) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = (GreaterBoston %>% st_transform(crs = 4326)), color = "grey", fill = "transparent")+
  geom_point(aes(x = start.station.longitude, y = start.station.latitude, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(rides_sf$start.station.latitude), max(rides_sf$start.station.latitude))+
  xlim(min(rides_sf$start.station.longitude), max(rides_sf$start.station.longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set",
       caption = "Figure 4.5")+
  mapTheme()

Lastly, we performed a 100-fold cross validation over the 5-week panel. We still chose the time lage features as the predictors. The Rsquared is 0.42, which means 42% of the variance in the dependent variable is predictable from the predictors. The RMSE is 0.72, which means that on average, the predicted values are 0.726 away from the observed values.

control <- trainControl(method="cv", number=100)
set.seed(999)
model_cv <- train(Trip_Count ~ start.station.id + hour(interval60) + dotw + Temperature + Precipitation + lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day, 
                  data=bos_grouped_ride.panel,
                  method="lm",
                  trControl=control,
                  na.action = na.pass
                  )

print(model_cv)
## Linear Regression 
## 
## 279460 samples
##     11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 276665, 276665, 276666, 276665, 276666, 276665, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.7260187  0.4214707  0.3771357
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

V. Conclusion

Based on the regression model results, the space and time with the time lags features have the most accurate and generalizable predictions. However, although the predicted trends and values are close to the observed values, it is still underpredicted. The areas with the most bike demands generate more absolute errors than the surrounding towns in Boston from afternoon to the end of the day. For the k-fold cross validation results, the R-squared value is relatively low. It does not have a good fit, but the model can still be useful for the re-banlancing plan. Programmers should focus more on the university area in Cambridge and the downtown areas during the end of weekdays to balance the need of bikes from the nearby suburban bike stations.

Data References:

https://bluebikes.com/system-data